SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)

نویسندگان

  • Jöran Beel
  • Bela Gipp
  • Ammar Shaker
  • Nick Friedrich
چکیده

Extracting titles from a PDF’s full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating power) machine learning algorithms such as Support Vector Machines and Conditional Random Fields. In this paper we present a simple rule based heuristic, which considers style information (font size) to identify a PDF’s title. In a first experiment we show that this heuristic delivers better results (77.9% accuracy) than a support vector machine by CiteSeer (69.4% accuracy) in an ‘academic search engine’ scenario and better run times (8:19 minutes vs. 57:26 minutes).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhanced Techniques for PDF Image Segmentation and Text Extraction

Extracting text objects from the PDF images is a challenging problem. The text data present in the PDF images contain certain useful information for automatic annotation, indexing etc. However variations of the text due to differences in text style, font, size, orientation, alignment as well as complex structure make the problem of automatic text extraction extremely difficult and challenging j...

متن کامل

A Document Engineering Approach to Automatic Extraction of Shallow Metadata from Scientific Publications

Semantic metadata can be considered one of the foundational blocks of the Semantic Web and Desktop. This report describes a solution for automatic metadata extraction from scientific publications, published as PDF documents. The proposed algorithms follow a low-level document engineering approach, by combining mining and analysis of the publications’ text based on its formatting style and font ...

متن کامل

Modeling Reader's Emotional State Response on Document's Typographic Elements

We present the results of an experimental study towards modeling the reader’s emotional state variations induced by the typographic elements in electronic documents. Based on the dimensional theory of emotions we investigate how typographic elements, like font style (bold, italics, bold-italics) and font (type, size, color and background color), affect the reader’s emotional states, namely, Ple...

متن کامل

A Survey of Indexing and Retrieval of Multimodal Documents: Text and Images

A document conveys information using multiple modalities, including text, layout/style and images. For example, journal articles usually have figures to illustrate experimental results, and the title in a journal article usually has a different font size than the body text. Indexing and retrieval using only text is the traditional way of IR (Information Retrieval). With the development of the I...

متن کامل

Potential angiotensin converting enzyme (ACE) inhibitors from Iranian traditional plants described by Avicenna’s Canon of Medicine

Objective: Hypertension is an important cause of cardiovascular disorders. The angiotensin converting enzyme (ACE) plays an important role in hypertension; therefore, inhibition of ACE in treatment of chronically elevated blood pressure is an important therapeutic approach. In the current review, we have p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010